Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Algorithms for Binary Neural Networks

large-scale problem on a huge data set [53]. We propose to solve the problem within the

backpropagation framework by considering: 1) the inference process of the optimized model

is based on the quantized variables, which means that the variable must be quantized in

the forward pass (corresponding to the inference) during training, and the loss is calculated

based on the quantized variables; the backpropagation is not necessarily to be quantized,

which however needs to consider the relationship between quantized variables and their

counterparts fully. Based on the above considerations, we propose that in the kth iteration,

based on the projection in Eq. 3.29, x^[^k^]is quantized to ˆx^[^k^]in the forward pass as

ˆx^[^k^]= PΩ(ω, x^[^k^]),

(3.30)

which is used to improve the backpropagation process by deﬁning an objective as

min

f(ω, x)

s.t.

ˆx^[^k^]

= P ^j

Ω⁽^ω^j^{, x}⁾^,

(3.31)

where ωj, j ∈{1, ..., J} is the jth projection matrix³, and J is the total number of projection

matrices. To solve the problem in (3.31), we deﬁne our update rule as

x ←x^[^k^]−ηδ^[^k^]

ˆx ^,

(3.32)

where the superscript [k +1] is removed from x, δˆx is the gradient of f(ω, x) with respect to

x = ˆx, and η is the learning rate. The quantization process ˆx^[^k^]←x^[^k^], that is, P ^j

Ω⁽^ω^j^{, x}^[^k^]^),

is equivalent to ﬁnding the projection of ωj ◦(x + ηδ^[^k^]

ˆx ^{) onto Ω as}

ˆx^[^k^]= arg min

ˆx ^{∥^ˆ^x⁻^ω^j^◦⁽^x⁺^ηδ^[^k^]

ˆx ⁾^∥²^,^ˆ^x^∈^Ω^}^.

(3.33)

Obviously, ˆx^[^k^]is the solution to the problem in (3.33). So, by incorporating (3.33) into

f(ω, x), we obtain a new formulation for (3.31) based on the Lagrangian method as

min f(ω, x) + ^λ

∥ˆx^[^k^]−ωj ◦(x + ηδ^[^k^]

ˆx ⁾^∥²^.

(3.34)

The newly added part (right) shown in (3.34) is a quadratic function and is referred to as

projection loss.

3.5.3

Theoretical Analysis

We take a close look at the projection loss in Eq. 3.34; we have

ˆx^[^k^]−ω ◦(x + ηδ^[^k^]

ˆx ^{) = ˆ}^x^[^k^]⁻^ω^◦^x⁻^ω^◦^ηδ^[^k^]

ˆx ^.

(3.35)

In this case, we only consider one projection function, so the subscript j of ωj is omitted for

simplicity. For multiple projections, the analysis is given after that. In the forward step, only

the discrete kernel values participate in the calculation, so their gradients can be obtained

∂f(ω, ˆx^[^k^])

∂ˆx^[^k^]

= ω ◦δ^[^k^]

ˆx ^,

(3.36)

3Since the kernel parameters x are represented as a matrix, ωj denotes a matrix as ω.